257 research outputs found

    Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation

    Full text link
    We introduce and study a new data sketch for processing massive datasets. It addresses two common problems: 1) computing a sum given arbitrary filter conditions and 2) identifying the frequent items or heavy hitters in a data set. For the former, the sketch provides unbiased estimates with state of the art accuracy. It handles the challenging scenario when the data is disaggregated so that computing the per unit metric of interest requires an expensive aggregation. For example, the metric of interest may be total clicks per user while the raw data is a click stream with multiple rows per user. Thus the sketch is suitable for use in a wide range of applications including computing historical click through rates for ad prediction, reporting user metrics from event streams, and measuring network traffic for IP flows. We prove and empirically show the sketch has good properties for both the disaggregated subset sum estimation and frequent item problems. On i.i.d. data, it not only picks out the frequent items but gives strongly consistent estimates for the proportion of each frequent item. The resulting sketch asymptotically draws a probability proportional to size sample that is optimal for estimating sums over the data. For non i.i.d. data, we show that it typically does much better than random sampling for the frequent item problem and never does worse. For subset sum estimation, we show that even for pathological sequences, the variance is close to that of an optimal sampling design. Empirically, despite the disadvantage of operating on disaggregated data, our method matches or bests priority sampling, a state of the art method for pre-aggregated data and performs orders of magnitude better on skewed data compared to uniform sampling. We propose extensions to the sketch that allow it to be used in combining multiple data sets, in distributed systems, and for time decayed aggregation

    An O(n^3)-Time Algorithm for Tree Edit Distance

    Full text link
    The {\em edit distance} between two ordered trees with vertex labels is the minimum cost of transforming one tree into the other by a sequence of elementary operations consisting of deleting and relabeling existing nodes, as well as inserting new nodes. In this paper, we present a worst-case O(n3)O(n^3)-time algorithm for this problem, improving the previous best O(n3logn)O(n^3\log n)-time algorithm~\cite{Klein}. Our result requires a novel adaptive strategy for deciding how a dynamic program divides into subproblems (which is interesting in its own right), together with a deeper understanding of the previous algorithms for the problem. We also prove the optimality of our algorithm among the family of \emph{decomposition strategy} algorithms--which also includes the previous fastest algorithms--by tightening the known lower bound of Ω(n2log2n)\Omega(n^2\log^2 n)~\cite{Touzet} to Ω(n3)\Omega(n^3), matching our algorithm's running time. Furthermore, we obtain matching upper and lower bounds of Θ(nm2(1+lognm))\Theta(n m^2 (1 + \log \frac{n}{m})) when the two trees have different sizes mm and~nn, where m<nm < n.Comment: 10 pages, 5 figures, 5 .tex files where TED.tex is the main on

    Reconstructing David Huffman's Origami Tessellations

    Get PDF
    David A. Huffman (1925–1999) is best known in computer science for his work in information theory, particularly Huffman codes, and best known in origami as a pioneer of curved-crease folding. But during his early paper folding in the 1970s, he also designed and folded over a 100 different straight-crease origami tessellations. Unlike most origami tessellations designed in the past 20 years, Huffman's straight-crease tessellations are mostly three-dimensional, rigidly foldable, and have no locking mechanism. In collaboration with Huffman's family, our goal is to document all of his designs by reverse-engineering his models into the corresponding crease patterns, or in some cases, matching his models with his sketches of crease patterns. Here, we describe several of Huffman's origami tessellations that are most interesting historically, mathematically, and artistically.National Science Foundation (U.S.) (Origami Design for Integration of Self-assembling Systems for Engineering Innovation Grant EFRI-1240383)National Science Foundation (U.S.) (Expedition Grant CCF-1138967

    On the Structure of Equilibria in Basic Network Formation

    Full text link
    We study network connection games where the nodes of a network perform edge swaps in order to improve their communication costs. For the model proposed by Alon et al. (2010), in which the selfish cost of a node is the sum of all shortest path distances to the other nodes, we use the probabilistic method to provide a new, structural characterization of equilibrium graphs. We show how to use this characterization in order to prove upper bounds on the diameter of equilibrium graphs in terms of the size of the largest kk-vicinity (defined as the the set of vertices within distance kk from a vertex), for any k1k \geq 1 and in terms of the number of edges, thus settling positively a conjecture of Alon et al. in the cases of graphs of large kk-vicinity size (including graphs of large maximum degree) and of graphs which are dense enough. Next, we present a new swap-based network creation game, in which selfish costs depend on the immediate neighborhood of each node; in particular, the profit of a node is defined as the sum of the degrees of its neighbors. We prove that, in contrast to the previous model, this network creation game admits an exact potential, and also that any equilibrium graph contains an induced star. The existence of the potential function is exploited in order to show that an equilibrium can be reached in expected polynomial time even in the case where nodes can only acquire limited knowledge concerning non-neighboring nodes.Comment: 11 pages, 4 figure

    Beyond Worst-Case Analysis for Joins with Minesweeper

    Full text link
    We describe a new algorithm, Minesweeper, that is able to satisfy stronger runtime guarantees than previous join algorithms (colloquially, `beyond worst-case guarantees') for data in indexed search trees. Our first contribution is developing a framework to measure this stronger notion of complexity, which we call {\it certificate complexity}, that extends notions of Barbay et al. and Demaine et al.; a certificate is a set of propositional formulae that certifies that the output is correct. This notion captures a natural class of join algorithms. In addition, the certificate allows us to define a strictly stronger notion of runtime complexity than traditional worst-case guarantees. Our second contribution is to develop a dichotomy theorem for the certificate-based notion of complexity. Roughly, we show that Minesweeper evaluates β\beta-acyclic queries in time linear in the certificate plus the output size, while for any β\beta-cyclic query there is some instance that takes superlinear time in the certificate (and for which the output is no larger than the certificate size). We also extend our certificate-complexity analysis to queries with bounded treewidth and the triangle query.Comment: [This is the full version of our PODS'2014 paper.

    A PTAS for planar group Steiner tree via spanner bootstrapping and prize collecting

    Get PDF
    We present the first polynomial-time approximation scheme (PTAS), i.e., (1 + ϵ)-approximation algorithm for any constant ϵ > 0, for the planar group Steiner tree problem (in which each group lies on a boundary of a face). This result improves on the best previous approximation factor of O(logn(loglogn)O(1)). We achieve this result via a novel and powerful technique called spanner bootstrapping, which allows one to bootstrap from a superconstant approximation factor (even superpolynomial in the input size) all the way down to a PTAS. This is in contrast with the popular existing approach for planar PTASs of constructing lightweight spanners in one iteration, which notably requires a constant-factor approximate solution to start from. Spanner bootstrapping removes one of the main barriers for designing PTASs for problems which have no known constant-factor approximation (even on planar graphs), and thus can be used to obtain PTASs for several difficult-to-approximate problems. Our second major contribution required for the planar group Steiner tree PTAS is a spanner construction, which reduces the graph to have total weight within a factor of the optimal solution while approximately preserving the optimal solution. This is particularly challenging because group Steiner tree requires deciding which terminal in each group to connect by the tree, making it much harder than recent previous approaches to construct spanners for planar TSP by Klein [SIAM J. Computing 2008], subset TSP by Klein [STOC 2006], Steiner tree by Borradaile, Klein, and Mathieu [ACM Trans. Algorithms 2009], and Steiner forest by Bateni, Hajiaghayi, and Marx [J. ACM 2011] (and its improvement to an efficient PTAS by Eisenstat, Klein, and Mathieu [SODA 2012]. The main conceptual contribution here is realizing that selecting which terminals may be relevant is essentially a complicated prize-collecting process: we have to carefully weigh the cost and benefits of reaching or avoiding certain terminals in the spanner. Via a sequence of involved prize-collecting procedures, we can construct a spanner that reaches a set of terminals that is sufficient for an almost-optimal solution. Our PTAS for planar group Steiner tree implies the first PTAS for geometric Euclidean group Steiner tree with obstacles, as well as a (2 + ϵ)-approximation algorithm for group TSP with obstacles, improving over the best previous constant-factor approximation algorithms. By contrast, we show that planar group Steiner forest, a slight generalization of planar group Steiner tree, is APX-hard on planar graphs of treewidth 3, even if the groups are pairwise disjoint and every group is a vertex or an edge

    Network Creation Games: Think Global - Act Local

    Full text link
    We investigate a non-cooperative game-theoretic model for the formation of communication networks by selfish agents. Each agent aims for a central position at minimum cost for creating edges. In particular, the general model (Fabrikant et al., PODC'03) became popular for studying the structure of the Internet or social networks. Despite its significance, locality in this game was first studied only recently (Bil\`o et al., SPAA'14), where a worst case locality model was presented, which came with a high efficiency loss in terms of quality of equilibria. Our main contribution is a new and more optimistic view on locality: agents are limited in their knowledge and actions to their local view ranges, but can probe different strategies and finally choose the best. We study the influence of our locality notion on the hardness of computing best responses, convergence to equilibria, and quality of equilibria. Moreover, we compare the strength of local versus non-local strategy-changes. Our results address the gap between the original model and the worst case locality variant. On the bright side, our efficiency results are in line with observations from the original model, yet we have a non-constant lower bound on the price of anarchy.Comment: An extended abstract of this paper has been accepted for publication in the proceedings of the 40th International Conference on Mathematical Foundations on Computer Scienc

    Reflections on Tiles (in Self-Assembly)

    Full text link
    We define the Reflexive Tile Assembly Model (RTAM), which is obtained from the abstract Tile Assembly Model (aTAM) by allowing tiles to reflect across their horizontal and/or vertical axes. We show that the class of directed temperature-1 RTAM systems is not computationally universal, which is conjectured but unproven for the aTAM, and like the aTAM, the RTAM is computationally universal at temperature 2. We then show that at temperature 1, when starting from a single tile seed, the RTAM is capable of assembling n x n squares for n odd using only n tile types, but incapable of assembling n x n squares for n even. Moreover, we show that n is a lower bound on the number of tile types needed to assemble n x n squares for n odd in the temperature-1 RTAM. The conjectured lower bound for temperature-1 aTAM systems is 2n-1. Finally, we give preliminary results toward the classification of which finite connected shapes in Z^2 can be assembled (strictly or weakly) by a singly seeded (i.e. seed of size 1) RTAM system, including a complete classification of which finite connected shapes be strictly assembled by a "mismatch-free" singly seeded RTAM system.Comment: New results which classify the types of shapes which can self-assemble in the RTAM have been adde

    Augmenting graphs to minimize the diameter

    Full text link
    We study the problem of augmenting a weighted graph by inserting edges of bounded total cost while minimizing the diameter of the augmented graph. Our main result is an FPT 4-approximation algorithm for the problem.Comment: 15 pages, 3 figure

    The Power of Duples (in Self-Assembly): It's Not So Hip To Be Square

    Full text link
    In this paper we define the Dupled abstract Tile Assembly Model (DaTAM), which is a slight extension to the abstract Tile Assembly Model (aTAM) that allows for not only the standard square tiles, but also "duple" tiles which are rectangles pre-formed by the joining of two square tiles. We show that the addition of duples allows for powerful behaviors of self-assembling systems at temperature 1, meaning systems which exclude the requirement of cooperative binding by tiles (i.e., the requirement that a tile must be able to bind to at least 2 tiles in an existing assembly if it is to attach). Cooperative binding is conjectured to be required in the standard aTAM for Turing universal computation and the efficient self-assembly of shapes, but we show that in the DaTAM these behaviors can in fact be exhibited at temperature 1. We then show that the DaTAM doesn't provide asymptotic improvements over the aTAM in its ability to efficiently build thin rectangles. Finally, we present a series of results which prove that the temperature-2 aTAM and temperature-1 DaTAM have mutually exclusive powers. That is, each is able to self-assemble shapes that the other can't, and each has systems which cannot be simulated by the other. Beyond being of purely theoretical interest, these results have practical motivation as duples have already proven to be useful in laboratory implementations of DNA-based tiles
    corecore